\documentclass[10pt,twocolumn,letterpaper]{article}

\usepackage{cvpr}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}

% Include other packages here, before hyperref.

% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex.  (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}


\cvprfinalcopy % *** Uncomment this line for the final submission

\def\cvprPaperID{****} % *** Enter the CVPR Paper ID here
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}

% Pages are numbered in submission mode, and unnumbered in camera-ready
\ifcvprfinal\pagestyle{empty}\fi
\begin{document}

%%%%%%%%% TITLE
\title{Project Status Report: Tiny Motion Capture System using Wii Remote}

\author{Wei Zhang, Yi Gong\\
Department of Computer Science\\
University of California, Santa Barbara\\
{\tt\small wei, ygong@cs.ucsb.edu}\\
% For a paper whose authors are all at the same institution,
% omit the following lines up until the closing ``}''.
% Additional authors and addresses can be added with ``\and'',
% just like the second author.
% To save space, use either the email address or home page, not both
%\and
%Second Author\\
%Institution2\\
%First line of institution2 address\\
%{\small\url{http://www.cs.ucsb.edu/~wei}}
}

\maketitle
\thispagestyle{empty}

%%%%%%%%% ABSTRACT
%\begin{abstract}

%\end{abstract}

%%%%%%%%% BODY TEXT
\section{Introduction}
Motion capture technique is a well studied area in the past decades, 
many commercial systems are available in the market. However, most of them 
are complicated and very expensive, built only for professionals.

The Nentindo Wii console has a built-in IR sensor in its remote
controller, with on-board hardware processing capability to track
up to four IR markers simultaneously. This give us the possibility
to build the cheapest motion capture system ever, 
and if consider that Wii has
already been sold to over 20M families, 
our system will only cost them only a few IR markers. 
Such kind of low-cost and tiny motion capture system can be used as a basis 
to build many new applications for home interactive entertainment.

Several interesting application of Wii remote has been discovered 
and result in great impact, the most famous ones are the head tracking
and multi-touch whiteboard developed by Johnny at CMU\cite{JohnnyVR} \cite{JohnnyTouch}.
However, all of current applications only use a single Wii remote for 
its built-in 2D tracking capability.

In our project we are going to build a 3D tracking system using two Wii
remote, this is a classic 3D reconstruction from multiple view problem 
which has been explored by many researchers. For the 2 view cases, 
a lot of studies have been made on the fundamental 
matrix and calibrated or uncalibrated image matching
\cite{luong95},\cite{Higgins87} \cite{Zhang95}~\cite{Hartley95}. 
\cite{Hartley03} is also a good book that cover these issues. 
In our method, we will calibrate our camera by \cite{Zhang00}'s 
algorithm first, then calculate the fundamental matrix 
and reconstruct 3D refer to \cite{Faugeras92}.
%------------------------------------------------------------------------

\section{Our Approach}

%------------------------------------------------------------------------

\subsection{Hardware Platform}
Our system consist of two Wii remote as IR cameras, and four IR leds as markers.
Every Wii remote contains a 1024x768 infrared camera 
with built-in hardware on-board processing to provide 
tracking of up to 4 points (e.g. IR led lights) at 100Hz. 
In addition, Wii remote can be connected to the PC via bluetooth. 
Our marker is just a regular IR led attached to a AA battery, 
the most sensitive wavelength for Wii remote is between 800 and 950 nm.
%-------------------------------------------------------------------------

\subsection{Softawre Infrastructure}
We want to build our software as an open 3D tracking library so that 
other developers can use it to make their own applications. In this project, 
the software infrastructure can be described from bottom to top as following:
\begin{itemize}
\item The bottom level is WiiRemoteLib. This is the hardware driver that enable
us to communicate and control the Wii remote through bluetooth wireless, 
the 'image' we get from Wii remote is actually coordinates of points
in camera's image plane.
\item Camera calibration provides the intrinsic and extrinsic parameters 
of two cameras during initialization.
\item At the stereo vision level we use the camera parameters and detected marker positions
to produce their 3D position.
\item Finally the tracking results are provided to our interactive application, 
for example, to control the role's movement in a fps game base on 
the player's certain specific postures.
\end{itemize}

%-------------------------------------------------------------------------
\section{3D Reconstruction Based on Two Views}
The 3D reconstruction of our system can be devided into three steps: camera calibration, point pair matching and 3D position calculation. 
\subsection{Camera Calibration}
Although a lot of researchers have exploited the methods of multiple view stereo problem based on uncalibrated cameras, these methods need at least 7 marks\cite{7points}, as fundamental matrix has 7 freedoms. Due to the 4-marks limitation of wii driver, we cannot apply enough marks to implement our system without camera calibration. Therefore, we take use of Zhang's calibration method by calling the routine of OpenCV. At the first stage of our method, we will put a measured square plane, with four infrared LEDs attached around its corners, in front of the two infrared cameras. For each camera, our application will analysis the location of the four 2D coordinates and bind them with correspond predefined 3D coordinates automatically. Then it will do the calibration and get the rotation matrics $R_{1}$, $R_{2}$ and translation matrix $t_{1}$, $t_{2}$. To easier our further work of essential matrix calculation and 3D reconstruction, we do a world transformation to make the transformation matrix of the first camera change from $( R1~ | ~t1 )$ to $( I~ | ~0 )$. It can be proved that the transformation matrix of the second camera will becomes 
\equation{( R_{2} R_{1}^{-1}~ | ~t_{2} - R_{2} R_{1}^{-1} t_{1} )}\endequation after this world transformation. 

\subsection{Point Pair Matching}
In the point pair matching stage, we take use of the epipolar geometry to find the one-to-one relationships between two groups of points from the two camera's view. This relationship is built based on the essential matrix $E$, which is similar to fundamental matrix but just applied to normalized image coordinates: \equation{x'_{n}~E~x_{n}=0}\endequation Here $x_{n}$ and $x'_{n}$ are normalized image coordinates of the same 3D points in view1 an view2. It is obvious that $x_{n}$ and $x'_{n}$ can be get by just applying the inverse instrinsic matrics of the cameras to their original coordinates $x_{o}, x'_{o}$: \equation{x_{n} = K_{1}^{-1}x_{o} ~~~and~~~ x'_{n} = K_{2}^{-1}x'_{o}}\endequation where $K_{1}$ and $K_{2}$ are the instrinsic matrix of the two cameras.
The essentail matrix can be calcualted from the transform vector and rotation matrix between the two cameras: \equation{E = [t]_{\times}R}\endequation
As we've mentioned in last subsection, the rotation matrix and translation vector between our two cameras are: \equation{R = R_{2} R_{1}^{-1}}\endequation \equation{t = t_{2} - R_{2} R_{1}^{-1} t_{1}}\endequation 

Getting the essential matrix, Equation 2 tell us the property of correspondant point pairs. Actually, recally that \equation{l' = Ex_{n}}\endequation we can see the geometry meaning of Equation 2: $x_{n}$ 's correspondant point $x'_{n}$ should lie on the the epipolar line $l'$ in the second view. 
However, such requirement can only be meet in ideal situation. Noises and errors brought in the whole process make this property unreliable. In such cases, the problem becomes a optimal solution search based on energy minimization. The energy here is defined as a function of distance from the 2D point to its pair's epipolar line: \equation{d(x,l')^2+d(x',l)^2}  \endequation We try all possible point pairs and keep the one that minimizes the energy.

\subsection{3D Coordinates Calculation}
To calculate the 3D position of $x$ and $x'$, we use the inhomogenous linear triangulation method\cite{multipleview}. Knowing that $x = PX$ and $x' = P'X$, where $P$ and $P'$ are the projection matrix of camera1 and camera2, we can apply the factor that $x\times(PX) = 0$ and $x'\times(PX) = 0$. Each of them will give us three linear equations with dependency that allows us to elimiate one. Finally, it will give us four linear equations together. If we reorginanize the coefficients and variables, we can get the linear equations in the form of $AX = 0$, where $A$:
\equation{A = \left [ \begin{array}{c} xp^{3T} - p^{1T} \\ yp^{3T} - p^{2T} \\x'p'^{3T} - p^{1T} \\ y'p'^{3T} - p^{2T}\end{array} \right ] }\endequation
$p^{iT}$ here represents the $i$th row of P and similarly $p'^{iT}$. Replace the X with $(X,Y,Z,1)^{T}$, we can derive this $4\times4$ homogenous system to a $4\times3$ inhomogenous one. The right hand side vector $b$ will just be the negative fourth column of original $A$ matrix. This inhomogenous can be solved by many mature factorization method, like SVD or QR decomposition, which will gives the least square solution if $b$ is not in the range space of $A$. This solution gives us the 3D coordinates of the 2D point pair $x$ and $x'$

%\enumerate{
%\item 
%}
%\endenumerate

%Each pair corresponds to the projections of the
% same scene point. Unlike common commercial motion capture systems 
%using a cloud of markers, we use only four, so most of time the basic
% technique for spatial correspondence will work just fine. But when there 
%are some special cases happen, for example, two epipolar lines constructed
%from image1 are very close in image2, or two points in image2 are too close
%to the same epipolar line, then temporal correspondence information can 
%be used to improve the correctness of matching.



%\subsection{Spatial Correspondence}
%In the 3D reconstruction step, we will find the 
%correspondence between points pairs on the two image 
%plane by the algorithm based on Epipolar Geometry. 
%From the definition of Epipolar Geometry, 
%we can see that the projections of point p onto 
%two image plane, p1 and p2, have the relationship 
%that:\equation{p^{T}Fp'^{-1} = 0}\endequation
%
%where F is a 3 x 3, rand 2 matrix, which is also 
%called fundamental matrix\cite{Faugeras92}. Fundamental matrix has 
%proven to be related with essential matrix E:\equation{E = t\times R}\endequation, 
%where R and t is the rotation and translation from first camera to second camera.
%
%Since we''ve get our cameras calibrated, the calculation of 
%the fundamental matrix becomes straightforward. After 
%getting the fundamental matrix, we can relate the 
%tracked points in two images from the two cameras. 
%After finding the correspondence between the points 
%in two image planes, calculating the 3D position of marks is not hard to handle. 
%%-------------------------------------------------------------------------
%
%\subsection{Temporal Correspondence}
%The temporal correspondence problem (tracking) involves matching 
%two sets of 3D points representing detected markers at two consecutive
% frames, respectively. Given the correspondence between consecutive
% frames, a time series of 3D coordinates is built.
%
%In order to match points in image to objects, we must first estimate 
%where the objects are supposed to be. Since our system are used on low-speed
%human motions, the mathematical model of object movement can be simulated with a 
%uniformly accelrated rectilinear motion. Based on this model, there exist 
%many object tracking and position estimation algorithm, we plan to use the
%classic Kalman filter to calculate the estimation.
%
%So our stereo matching algorithm can be described as follows: For each 
%time {\em t}, we calculate the position prediction for every points based 
%on its previous position and estimation of speed. 
%Then we project the points in image1 to epipolar lines in image2, 
%and match the points in image2 to its corresponding lines to build the 
%spatial correspondency. Finally we project the estimated
%position to image2, match them with the points in images to build the 
%correspondency between points and physical objects.
%-------------------------------------------------------------------------

%\section{Conclusion}
\section{Planned Experiments}
Through our experiments, we will use two cameras. The two camera will have 1~2m distance from each other and their view direction have about 90 degree difference around y axis. Most test case will happen in the intersection of visible region of the two cameras. But we'll also test the case that some marks get out of this region for a while to check the robustness of our system.


Our experiments will contain several parts:


1. Single and static mark localization. This test will mainly focus on measuring the correctness of our 3D reconstruction algorithm. We will first do the test by puting one LED in several places in the scene(a reasonable visible region of the two cameras) and calculate their 3D locations by our application. Then we'll measure the physical distance and get the real coordinates of the LED mark and compare the measured value to the calculated result. The average error should not be larger than 0.25m. 


2. Multiple moving marks tracking and localization without invisiblity. In the test, we will attach 4 LED marks on a person's one/two arms/legs. Then ask the person to move arms or legs in slow, medium and fast speed with smooth/jitter motion and check whether all of these marks can be tracked correctly from frame to frame and can be correctly localized. Since the scene is dynamic, we can only estimate the correctness when mark is moving and do the physical measurement when stop move marks. So the result should be visually correct in about 90\% frames and the average measurement error should also smaller than 0.25m.


3. Multiple moving marks tracking and localization with occlusion or out-of-view. In this case, we will allow that marks become invisible during their movement for a while. We will test 1 to 4 marks invisible in one/two views to check whether the motion estimation can give a reasonable guess under these cases. The expected result is that if 1 or 2 marks with smooth movement cannot be seen in less than 30 continous frames, we should still able to give correct result. We will also test the jitter motion, just hope it can also works well.


4. Virtual object control. This actually is not a test but an application. We will put some virtual objects in the virtual scene, and control them by moving our arm or hand. Simple game can be developed based on this. 


5. Pose and Motion estimation (optional). We will define several patterns or parameters(like location and velocity) of possible pose and motion of a person's arms and legs. Then analyze the marks and try to match them with these existing patterns to evaluate the pose and motion of the person. The expected pose and motion is like jumping, walking runing, sqaut etc.

%Our system is expected to work well when get perfect input: no occlusion and motion is smooth with normal speed. We will try to %strengthen our application by estimate points' position from previous acceleration and march the most likely point when searching %correspondence, so that it can handle short and partial occlusion. 
%-------------------------------------------------------------------------

{\small
\bibliography{egbib}
\bibliographystyle{ieee}
}

%-------------------------------------------------------------------------
\end{document}
